The dataset chosent for this project was the Red Wine Quality data provided from Udacity and loaded as a CSV file.
The parameter X appears to be simply an index and there is no indication that it is ordered in any way, presumably this is to ensure a unique identifier for wines with the same charicteristics.
There are multipe measures of acidity expressed and other chemical attributes expressed as float numbers. From the summary statistics, some initial observations about the typical attributes of red wine can be made:
Red wine is acidic, with a mean pH of 3.11, closely matched to is median and with a short, symetric IQR. This indicates a tight, well behaved (likely normal) distribution around the mean pH.
There are no indications of the units used for the chemical constituents, probably ppm for chlorides / sulfides / sulphates though the residual sugars may be expressed as a percentage just as alchohol is.
The summary stats of the residual sugars indicate the presence of a long tail or at least outliers on the upper end of the distributon due to the higher mean and very high max value relative to the median and the quartile positions. This may be as a result of desert wines included in the data set.
Similar properties can be seen in the summary stats of alcohol percentage though the max value is far less extreme suggesting a assymetric distribution more strongly than the presence of outliers.
Finally the quality parameter is a integer value that looks to be one of the keys variable of interest. It is a interger value ranging from 3 to 8, with a mean close to the cntral value of 5.6. The median is missmatched to the mean but its not clear if this is a symtom of the discrete nature of the variable.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## [1] "Ratio of range of 50%-75% quantile range to 25%-50% quantile range showing assymetry of distrbution, skewed towards the higher values of Residual Sugar. "
## 75%
## 1.333333
## [1] "Ratio of range of 50%-75% quantile range to 25%-50% quantile range showing assymetry of distrbution, skewed towards the higher values of Alchohol. "
## 75%
## 1.285714
From the plot of quality above it can be seen that most red wines are of quality value 5 or 6.
## [1] 1 1599
An initial question concerned the nature of the variable “x”, above is just confirming this is an index.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## [1] "Mean pH"
## # A tibble: 6 × 2
## quality avg
## <int> <dbl>
## 1 3 3.398000
## 2 4 3.381509
## 3 5 3.304949
## 4 6 3.318072
## 5 7 3.290754
## 6 8 3.267222
## [1] "Median pH"
## # A tibble: 6 × 2
## quality median
## <int> <dbl>
## 1 3 3.39
## 2 4 3.37
## 3 5 3.30
## 4 6 3.32
## 5 7 3.28
## 6 8 3.23
The distribution for pH appears normal, centered around a mean of 3.31 as discussed above.
Faceting by quality doesn not yield any obvious changes to the distribution. There is mean and median difference between the best and worst quality wines though it is unlear at this stage if this is sigificant.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugars has a roughly normal distribution but with a subststantial tial in the hight end of the dist. Applying a square root scale looks to contract the values for a more normal curve.
Its not at all clear what unit this variable uses, some quick googling typically give residual sugar units in g/L but with typicall values of 2-20 g/L this does not fit the data distribution. It appears residual sugar here is expressed as a percentage (wt % most likely) with values ranging from the driest reds at 0.9 % to dessert wines at 15%.
Creating a category for wine types based on sweetness might be appropriate her for seperatig different styles in future analysis, a basic categorization was found below and modifed for this data set (this is a bit crude and Dry wines are looking suspiciously sparse but will at least be usefull to subset out the outliers).
Source of residual sugars information: https://www.winecurmudgeon.com/residual-sugar-in-wine-with-charts-and-graphs/
The distribution of alchohol content of the wines follow an unusual distribution with limited wines below 9 % volume and then a verly long fat tail up to about 13 %. Applying several scales to this histrogram failled to provide any new insights nor did breaking the histograms out by quality bin. The peak at 9.5 % is most apparent in Q = 5 wines. Its possible that this distribution reflects the most popular styles of mid range wines and gives more information about trends wine consumtion rather than quality.
Density appears normally distributed though splitting by quality seems to produce two slightly skewed distribution sfor the most populated wine qualities (5 and 6). Breaking the mean and median denstities out by quality appears to show an overall downward trend in densty for increasing quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Mean Volatile Acidity"
## # A tibble: 6 × 2
## quality avg
## <int> <dbl>
## 1 3 0.8845000
## 2 4 0.6939623
## 3 5 0.5770411
## 4 6 0.4974843
## 5 7 0.4039196
## 6 8 0.4233333
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 127 127 8.2 1.33 0.00 1.7
## 128 128 8.1 1.33 0.00 1.8
## 673 673 9.8 1.24 0.34 2.0
## 1300 1300 7.6 1.58 0.00 2.1
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 127 0.081 3 12 0.99640 3.53
## 128 0.082 3 12 0.99640 3.54
## 673 0.079 32 151 0.99800 3.15
## 1300 0.137 5 9 0.99476 3.50
## sulphates alcohol quality Sweetness
## 127 0.49 10.9 5 Off Dry
## 128 0.48 10.9 5 Off Dry
## 673 0.53 9.5 5 Off Dry
## 1300 0.40 10.9 3 Off Dry
Fixed acidity has a long tail for higher acidity and applying new scales doesn’t seem to resolve this.
Volitile acidity is also complex with a possible bimodality hidden in the distribution and a long tail for higher acidity. Breaking the distributions out by quality the binodality becomes slightly more pronounced.
Information found online (http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity) sugggests volitle acidity as a measure of wine spoilage as it largely corresponds to acetic acid levels in the wine. The mean volatile acidity for the lowest quality of wines is higher than that of better quality wines but this relation breaks down for the higher quality wines.
Interstingly there are wines in the data set with volatile acidity beyond the US legal limit listed
## n
## 1 0.08255159
The citric acid content is pretty flat (~8%) with high volume of 0’s and another spike at 0.5. Seperating out the zero results shows there is no citric acid present in the highest quality bucket
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04100 0.07000 0.07850 0.07949 0.08800 0.12800
## Observations: 1
## Variables: 14
## $ X <dbl> -0.1960803
## $ fixed.acidity <dbl> 0.05215707
## $ volatile.acidity <dbl> 0.1372031
## $ citric.acid <dbl> 0.6102384
## $ residual.sugar <dbl> 0.3249878
## $ chlorides <dbl> 3.804413
## $ free.sulfur.dioxide <dbl> 0.01552478
## $ total.sulfur.dioxide <dbl> 0.04975434
## $ density <dbl> 0.3543852
## $ pH <dbl> -0.755478
## $ sulphates <dbl> 1.349306
## $ alcohol <dbl> -0.6938383
## $ quality <dbl> -0.344012
## $ Sweetness <dbl> NA
The cholrides content looks to have a normal distribution near the bottom of the scale but this is obscured by a very long tail of outliers. Shrinking the limits gives a more normal distribution centred aroun a mean of ~ 0.08.
It would be interesting to know what the properties of the high chloride outliers that can be seen are and if there are any clues as to what causes such high chloride contenct.
The table HC.delta gives the difference between the standardized mean properties of the high chloride wines and the origional complete data set. Standardization was completed to make the deltas comparable across all the parameters though it runs the risk of obscuring in the scale of some deltas where stDev is disporportunatly high due to outliers.
From the table generated it can be seen that the high chloride wines are also generally high in sulphates. They also appear higher citric acid, residual sugars and denstiy, and lower in pH, alchohol and quality. A statistical test such as a t-test would be useful to furhter investigate the significance of these differences if needed.
There are two measures of sulfur dioxide in the data set, free and total. Both appear to have extremely long tailed distributions but applying a log10 scale to the plots created a more normal distribution. Facet wrap of the totlal sulfur dioxide shows no major shifts in the distribution though a slight hint of bimodality can be seen for quality = 6.
Assuming the free suflur dioxide is a subset of the total, a new variable for the non-free sulfur dioxided was created to investigate if there us anything of interest in the residual of the two variables. The non-free distribution is a similar distribution to the other measures of suflur dioxide, with a log scale giving a more normal distribution.
The sulfate is another long tailled distribution where applying a log10 scale leads a far more normal looking plot. Applying a square root also helps but is less effective. Faceting over quality shows no major differences between the distributions for difference for different quality levels.
Finnally, due to the similarity in distributions and the hint of connections between the variables, a new variale was created (‘Total.additives’) that is the simple sum of sulphates and chlorides.
The data set consists of 14 variables related to the chemical properties of red wines. Except fot an index (X) and quality that are expressed as integers, all the parameters are expresse as floats numbers.
The main feature of the dataset is the assigned quality of the wines, a key question of this investigation will be to see how the other properties of the wines influence this quality factor.
Other important features are the residual sugar, alchohol and ph variable as these are likely to have th biggest impact on the properties of the wine.
New variables nonfree.sulfur.dioxide and Total.additives were created.
Several variables (suphides and sulfur dioxide) appear to be log10 disributed and residual sugars appears square root distributed.
No changes were made to ajust the data at this stage. It may be useful to change the quality data to a factor, but so far this has not been done for ease of plotting.
##
## Pearson's product-moment correlation
##
## data: QR$quality and QR$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
From the plotted variable some positive correlation between sulphates and citric acid can be seen.
A negative correlation between pH and citrica acid is seen, it is likely this is one of the several acidity indicators that contributes to pH. The negetive correlation makes sense as pH decreases with increasing acidity.
The strongest correlation to quality form the second ggally matric is the negitive correllation to volatile acidity (-0.391) Top 4 influencers of wine quality bucket are below. This correlation test only gives a measure of linear dependance so focusing on these variable may exclude any non linear relationships but this is beyond the scope of this first pass at the data set. sulpates = 0.251 citric acid = 0.226 volatile acidity = - 0.391 total sulphur dioxided = -0.185
Looking at the different measures of acidity there are clear relationships between some of the different measures of acidity. The strongest correlation is a negative correlatoon between pH and fixed acidity
There is a correlation between citric acid content and fixed acidity that suggests the possibility that these are not indedendant measures of acidity and the fixed acidity of a wine incorporates the impact of the citric acid content.
## [1] 0.2263725
From the boxplot of citric acid by quality we can see the correllation between the citric acid content and quality. There is a slight increase in mean citric acid content betwen quality buckets 5 ad 6 but the big differences can be seen comparing the low quality buckets, with loww citric acid, and the hight quality buckets that show the reverse.
## Warning: Removed 50 rows containing non-finite values (stat_boxplot).
## Warning: Removed 52 rows containing missing values (geom_point).
## Warning: Removed 127 rows containing non-finite values (stat_smooth).
## Warning: Removed 127 rows containing missing values (geom_point).
## [1] 0.3552834
Looking at residual sugars across the levels of quality there appears to be no correllation between residual sugar and quality.
A scatter plot of residual sugar and density shows an influence of residual sugar on the density of the wine. This would be expected given the higher density of sugards relative to the other main constituents (water and alchohol) though this correlation breaks down for the sweetest wines.
## Warning: Removed 42 rows containing missing values (geom_point).
## Warning: Removed 41 rows containing missing values (geom_point).
Chloride content looks static along the quality buckets.
##
## Pearson's product-moment correlation
##
## data: QR$quality and QR$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
From the trends in mean and median total sulfur dioxide over the different quality bins we can see there is not a clear linear negative correllation that the R value would suggest.
The clearest differences can be seen in the populations of bins 5, 6 and 7. The upper distribution of sulfur dioxide is eroded away giving a smaller range of sulfur dioxide values as the quality increases (ignoring outliers). This trend cannot be seen for the very low or very high quality buckets due to the lower overall populations. Its possible that if the data set was expanded to include a far higher number of observations for the low/high end of the quality scale the ranges would follow the same pattern as 5/6/7.
##
## Pearson's product-moment correlation
##
## data: QR$quality and QR$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
The correllation between wine quality and sulphates appears more clear cut with a liner relationship between sulphates and quality evident in the bivariate plot above. The trends in mean/median hold most clearly for the highly populated quality buckets and the pattern falters for the highest/lowest quality wines. Again, a data set with a higher number of low/high quality wines might hold the linear relationship more closely. The value for Pearson’s R (0.25) is in line with the observed positive linear dependance.
##
## Pearson's product-moment correlation
##
## data: QR$quality and QR$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Welch Two Sample t-test
##
## data: x and y
## t = 0, df = 3196, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01241656 0.01241656
## sample estimates:
## mean of x mean of y
## 0.5278205 0.5278205
## Warning: Removed 12 rows containing non-finite values (stat_bin).
The clearest linear correlation between quality and the other variables in the data set appears to be the negative relationship between quality and volatile acidity. This relationship is bourne out in both the large negative R value (-0.39) and the bivariate chart. Only for the highest quality wines does the relationship break down, either because of low population or because the difference between 7 and 8 quality wines is not expressed in volatile acidity. A t-test of the volatile acidity in quality values 7 and 8 indicates the means are matched within the 95% confidence interval. This statistical test might not be valid given the large difference in populations and the p value of 1 is highly suspicious. Despite this it doesn’t seem unreasonable to conclude the means are matched. This would boost the arguement that there the volatile acidity stabalizes at 7 and there is no further reduction for higher quality wines.
There are clear relationships between some variables in the data set. Most striking is the negative relationship between wine quality and volatile acidity. With the exception of the highest quality wines there is a clear downward trend with increasing volatile acidity. This makes sense based the make-up of volatile acidity as composing mainly of acetic acid and other acids considered as spoilants that would negativly impact the flavour.
Less important variables impacting quality include citric acid, sulphates and total sulfur dioxide. The relationships between these variables are less straightforward and break down for the lowest and highest quality buckets.
Wine information from: http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity
Other than looking at quality, we can see relationships between citric acid and pH that would be expected (pH being a measure of acidity). A linear relationship between residual sugars and density was also observed as would be expected given the large ,olecular weight of sugars.
As mentioned above the strongest relationship was the negative impact of volatile acidity on the wine quality.
## Warning: Removed 5 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: QR$volatile.acidity and QR$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
## Warning: Removed 27 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: QR$volatile.acidity and QR$sulphates
## t = -10.804, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3060917 -0.2147125
## sample estimates:
## cor
## -0.2609867
## Warning: Removed 23 rows containing missing values (geom_point).
Though not wholly successful the plan here was to was to map out areas of highest quality over the variable of highest impact on wine quality. Though the relationships explored in the previous section between were linear, it was clear that for high and low quality wines these relationships oftemn broke down. The aim of the previous plots was to see if high or low quality wines cluster in certain intersections of two variables.
##
## Call:
## lm(formula = quality ~ I(-volatile.acidity) + I(-citric.acid) +
## sulphates, data = QR.sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1900 -0.5048 -0.0554 0.4749 2.4065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.042170 0.150393 20.228 <2e-16 ***
## I(-volatile.acidity) 1.264746 0.134781 9.384 <2e-16 ***
## I(-citric.acid) 0.003739 0.124405 0.030 0.976
## sulphates 1.983579 0.168391 11.780 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7108 on 1403 degrees of freedom
## Multiple R-squared: 0.2158, Adjusted R-squared: 0.2141
## F-statistic: 128.7 on 3 and 1403 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: quality
## Df Sum Sq Mean Sq F value Pr(>F)
## I(-volatile.acidity) 1 123.52 123.516 244.4727 < 2e-16 ***
## I(-citric.acid) 1 1.46 1.460 2.8899 0.08936 .
## sulphates 1 70.11 70.106 138.7598 < 2e-16 ***
## Residuals 1403 708.85 0.505
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Calls:
## lm1: lm(formula = quality ~ I(-volatile.acidity), data = QR.sub)
## lm2: lm(formula = quality ~ I(-volatile.acidity) + I(-citric.acid),
## data = QR.sub)
## lm3: lm(formula = quality ~ I(-volatile.acidity) + I(-citric.acid) +
## sulphates, data = QR.sub)
## lm4: lm(formula = quality ~ I(-volatile.acidity) + I(-citric.acid) +
## sulphates + I(-density), data = QR.sub)
##
## =====================================================================
## lm1 lm2 lm3 lm4
## ---------------------------------------------------------------------
## (Intercept) 4.550*** 4.428*** 3.042*** 116.944***
## (0.063) (0.098) (0.150) (10.928)
## I(-volatile.acidity) 1.734*** 1.613*** 1.265*** 0.803***
## (0.116) (0.138) (0.135) (0.137)
## I(-citric.acid) -0.209 0.004 -0.616***
## (0.129) (0.124) (0.134)
## sulphates 1.984*** 2.181***
## (0.168) (0.163)
## I(-density) 114.813***
## (11.015)
## ---------------------------------------------------------------------
## R-squared 0.14 0.14 0.22 0.27
## adj. R-squared 0.14 0.14 0.21 0.27
## sigma 0.75 0.74 0.71 0.69
## F 222.37 112.63 128.71 131.10
## p 0.00 0.00 0.00 0.00
## Log-likelihood -1581.81 -1580.49 -1514.14 -1461.63
## Deviance 780.41 778.95 708.85 657.86
## AIC 3169.62 3168.98 3038.29 2935.27
## BIC 3185.37 3189.98 3064.53 2966.76
## N 1407 1407 1407 1407
## =====================================================================
Plots generated to look for potential models. Linear model generated but of very limited fit.
For citric acid and volatile acidity its clear the majority of highest quality wines are of both high citirc acid and low volatile acidity. The high quality wines that are not of low volatile acidicity are of low citric acid though this may just be a consequence of the clear inverse linear relationship between citric acid and volatile acidity. A correlation test confirms this relation with a R value of -0.55.
Looking a sulphates versus volatile acidity the high quality wines the highest quality wines are most promenent in the regeion of low volatile acidity and high sulphates content. The seperation between high and low quality wines can be seen clearly with the majority of low and mid quality wines clustering in the area of higher volatile acidity and lower sulphate content. In this case there is again an inverse linear relationship visible between the sulphate and volatile acidity content and a Pearsons R value of -0.26 confirms this less clear correlation.
Finally for sulphates versus citric acid the highest quality wines cluster in the area of high sulphate content. The seperation by sulphate content is clearer cut than citirc acid with some higher quality wines visible for lower citric acid but a much lower population of high quality wines in the low sulphates region (though with a tighter distribution of values). In this case there is no strong relationship between the variables.
As detailed above there was an inverse relationship between citric acid and volatile acidity that was not expected. Its not clear why one measure of acidity would negatively effect another and this analysis was not aimed at finding one. There was also a weaker correlation between sulphates and volatile acidity that was not expected though the effect is not as clear.
An attempt was made to generate a linear model for wine quality using the volatile acidity, citric acid and sulphate content of the wines. A subset of the dataset was used to exclude outliers and simplify the model. The results show this is a somewhat weak model with low R squared value 0.22 for the completed model. This model could serve as a basic starting point for generating a more complex model of red wine quality but could not be relied upon to accuratly predict wine quality based on the variables examined.
The graph above shows the impact of citric acid content on the quality of red wines, with increasing median citric acid content evident with increasing quality value. Displayed above is the Pearsons R value for citric acid content versus quality value indicating a small positive correllation between citric acid and teh quality of red wine.
Ths chart shows the relationship between volatile acidity and the quality of red wines with the value for Pearsons R displayed above. The negative impact on quality is clear from both the negative R value and the clear trend in mean and median citric acid content.
## Warning: Removed 22 rows containing missing values (geom_point).
This plot maps out the areas of highest quality red wines over the two variables that most influence red wine quality. From this plot it can be seen that the highest quality red wines are clustered in the area of high citric acid and volatile acidity.
The red wines data set contains a lot of interesting information on red wines and their chemical properties. Though there are multiple interesting relationships between these chemical indicators the most intersting question that could be asked of this data is how these physical attributes play into making a wine of high or low quality. A strong negative relationship between volatile acidity and quality was observed and a linear model based on this and other correlated variables created (though with limited predictive power).
Getting a clear picture of what determines quality in red wines from this data set has proved challenging for several reasons. Firstly the vast majority of the wines fall into the two centermost quality values of 5 and 6. As a result many factors that influence wine quality break down at either end of the scale and making robust models of wine quality are hampered by this thight distribution. An associated issue with this analysis has been the low number of unique values for quality, this limits the granularity of the analysis though treating the quality variable as a factor can be used to complete analysis not posible on more continous variables. The dataset does not include any measure of wine price, this would greatly increase the number of interesting questions that could be asked. Within each quality grouping there is a high degree of overlap for the values of each of the other associated variables that makes it hard to discern clear patterns of their impact to quality.
For next steps the linear model could be expanded to include other variables of lesser correllation with quality or redesigned completly. A simples but possible more informative way of comparing what a good vs a bad quality wines attributes are may be to bucket wines of quality 2-5 and 6-8 into two groups and compare their summary statistics to find out what makes the high quality group different to the low quality group. Ther may also be a way to combine multiple chemical properties of the wine into a single variable with greater predictive power on wine quality. Some attempts at this were made above but with no success.